-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
… deadlocks on MySQL (#80329) * Split subquery when cleaning annotations * update comment * Raise batch size, now that we pay attention to it * Iterate in batches * Separate cancellable batch implementation to allow for multi-statement callbacks, add overload for single-statement use * Use split-out utility in outer batching loop so it respects context cancellation * guard against empty queries * Use SQL parameters * Use same approach for tags * drop unused function * Work around parameter limit on sqlite for large batches * Bulk insert test data in DB * Refactor test to customise test data creation * Add test for catching SQLITE_MAX_VARIABLE_NUMBER limit * Turn annotation cleanup test to integration tests * lint --------- Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>
📝 WalkthroughWalkthroughRefactors annotation cleanup logic to process deletions in batches with context cancellation support, avoiding deadlocks under concurrent load. Updates cleanup trigger frequency from 10 to 1 minute and introduces integration tests with database state cleanup and batched insertion patterns. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@pkg/services/annotations/annotationsimpl/xorm_store.go`:
- Around line 534-538: The current routine cleanup logging uses r.log.Error for
normal batch progress (calls like r.log.Error("Annotations to clean by time",
... ) and r.log.Error("cleaned annotations by time", ... )) which will flood
error alerts; change these to r.log.Info or r.log.Debug as appropriate, stop
logging the ids slice (log only its length or sample) and remove the redundant
"err" field where err is known to be nil before the log; apply the same change
pattern to the other occurrences around the deleteByIDs calls (references:
r.log.Error, deleteByIDs, the "Annotations to clean by time" and "cleaned
annotations by time" messages) so only real errors use r.log.Error and routine
progress logs use Info/Debug with compact payloads.
In `@pkg/services/cleanup/cleanup.go`:
- Line 77: The ticker interval in Run (ticker := time.NewTicker(...)) is 1
minute while clean() uses a 9-minute context timeout, causing
overlapping/back-to-back runs; update them to be consistent by either restoring
the ticker to the previous longer interval (e.g., set ticker :=
time.NewTicker(time.Minute * 10)) or reducing the context timeout inside clean()
to something under 1 minute (e.g., <1m) so the context timeout is always shorter
than the ticker interval; ensure you update both the ticker declaration and the
corresponding timeout logic used by clean() (and similar logic referenced in
lines 88-94) so they remain aligned.
🧹 Nitpick comments (3)
pkg/services/annotations/annotationsimpl/cleanup_test.go (1)
237-271: Manual ID assignment couples test setup to DB internals.Line 241 sets
ID: int64(i + 1), which relies on the DB accepting explicitly provided IDs and not conflicting with auto-increment sequences. This works for SQLite but may break if the test DB engine changes or if rows from a previous subtest leave the auto-increment counter advanced (thoughDELETEdoesn't reset it). Since the IDs are also baked intonewAnnotationTagsat line 270, any mismatch would silently produce orphaned tags.Consider either removing explicit ID assignment and querying back for inserted IDs, or documenting this assumption.
pkg/services/annotations/annotationsimpl/xorm_store.go (2)
536-538: Variable namesx,yare cryptic for returned values.
x, ydon't convey meaning. Useaffected, err(orbatchAffected, batchErrto avoid shadowing) for readability.Example fix for one occurrence
- x, y := r.deleteByIDs(ctx, "annotation", ids) - r.log.Error("cleaned annotations by time", "count", len(ids), "affected", x, "err", y) - return x, y + batchAffected, batchErr := r.deleteByIDs(ctx, "annotation", ids) + r.log.Debug("cleaned annotations by time", "count", len(ids), "affected", batchAffected, "err", batchErr) + return batchAffected, batchErrAlso applies to: 556-558, 578-580
605-618: SQLite ID-inlining branch uses O(n²) string concatenation.Lines 609–612 build the ID list via repeated
fmt.Sprintf("%s, %d", ...), which creates a new string allocation on every iteration. With the test batch size of 32767, this becomes expensive. Usestrings.Builderorstrings.Joininstead.Also, the condition at line 608 checks
r.cfg.AnnotationCleanupJobBatchSize > sqliteParameterLimit(the configured maximum), notlen(ids)(the actual count). This means even a batch of 5 IDs would take the inlining path if the configured batch size exceeds 999.Proposed fix using strings.Builder
- if r.db.GetDBType() == migrator.SQLite && r.cfg.AnnotationCleanupJobBatchSize > sqliteParameterLimit { - values := fmt.Sprint(ids[0]) - for _, v := range ids[1:] { - values = fmt.Sprintf("%s, %d", values, v) - } - sql = fmt.Sprintf(`DELETE FROM %s WHERE id IN (%s)`, table, values) + if r.db.GetDBType() == migrator.SQLite && len(ids) > sqliteParameterLimit { + var b strings.Builder + b.WriteString(strconv.FormatInt(ids[0], 10)) + for _, v := range ids[1:] { + b.WriteByte(',') + b.WriteString(strconv.FormatInt(v, 10)) + } + sql = fmt.Sprintf(`DELETE FROM %s WHERE id IN (%s)`, table, b.String())
| r.log.Error("Annotations to clean by time", "count", len(ids), "ids", ids, "cond", cond, "err", err) | ||
|
|
||
| affected, err := r.executeUntilDoneOrCancelled(ctx, sql) | ||
| x, y := r.deleteByIDs(ctx, "annotation", ids) | ||
| r.log.Error("cleaned annotations by time", "count", len(ids), "affected", x, "err", y) | ||
| return x, y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.log.Error used for routine operational logging — will pollute error logs.
These log lines (lines 534, 537, 554, 557, 576, 579) all use Error level for what is normal batch-progress logging. In production, this will flood error dashboards/alerts every cleanup cycle. Use Debug or Info level instead.
Proposed fix (showing pattern for all occurrences)
- r.log.Error("Annotations to clean by time", "count", len(ids), "ids", ids, "cond", cond, "err", err)
+ r.log.Debug("Annotations to clean by time", "count", len(ids), "cond", cond)
- r.log.Error("cleaned annotations by time", "count", len(ids), "affected", x, "err", y)
+ r.log.Debug("cleaned annotations by time", "count", len(ids), "affected", x, "err", y)
- r.log.Error("Annotations to clean by count", "count", len(ids), "ids", ids, "cond", cond, "err", err)
+ r.log.Debug("Annotations to clean by count", "count", len(ids), "cond", cond)
- r.log.Error("cleaned annotations by count", "count", len(ids), "affected", x, "err", y)
+ r.log.Debug("cleaned annotations by count", "count", len(ids), "affected", x, "err", y)
- r.log.Error("Tags to clean", "count", len(ids), "ids", ids, "cond", cond, "err", err)
+ r.log.Debug("Tags to clean", "count", len(ids), "cond", cond)
- r.log.Error("cleaned tags", "count", len(ids), "affected", x, "err", y)
+ r.log.Debug("cleaned tags", "count", len(ids), "affected", x, "err", y)Additionally, note that lines 534 and 554 log "err", err where err is always nil at that point (it was already checked on line 531/551 and would have returned). The ids slice is also logged, which can be enormous for large batches — consider omitting it or logging only the count.
Also applies to: 554-558, 576-579
🤖 Prompt for AI Agents
In `@pkg/services/annotations/annotationsimpl/xorm_store.go` around lines 534 -
538, The current routine cleanup logging uses r.log.Error for normal batch
progress (calls like r.log.Error("Annotations to clean by time", ... ) and
r.log.Error("cleaned annotations by time", ... )) which will flood error alerts;
change these to r.log.Info or r.log.Debug as appropriate, stop logging the ids
slice (log only its length or sample) and remove the redundant "err" field where
err is known to be nil before the log; apply the same change pattern to the
other occurrences around the deleteByIDs calls (references: r.log.Error,
deleteByIDs, the "Annotations to clean by time" and "cleaned annotations by
time" messages) so only real errors use r.log.Error and routine progress logs
use Info/Debug with compact payloads.
| @@ -74,7 +74,7 @@ func (j cleanUpJob) String() string { | |||
| func (srv *CleanUpService) Run(ctx context.Context) error { | |||
| srv.cleanUpTmpFiles(ctx) | |||
|
|
|||
| ticker := time.NewTicker(time.Minute * 10) | |||
| ticker := time.NewTicker(time.Minute * 1) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ticker interval (1 min) vs. cleanup timeout (9 min) mismatch.
The ticker now fires every 1 minute, but clean() still uses a 9-minute context timeout. Since Run is single-threaded, a long-running cleanup will cause the next tick to fire immediately upon completion, leading to back-to-back cleanup cycles. The previous 10-minute interval matched the 9-minute timeout, leaving a 1-minute buffer. Consider reducing the timeout to something below 1 minute, or keeping the ticker interval above the timeout.
Also applies to: 88-94
🤖 Prompt for AI Agents
In `@pkg/services/cleanup/cleanup.go` at line 77, The ticker interval in Run
(ticker := time.NewTicker(...)) is 1 minute while clean() uses a 9-minute
context timeout, causing overlapping/back-to-back runs; update them to be
consistent by either restoring the ticker to the previous longer interval (e.g.,
set ticker := time.NewTicker(time.Minute * 10)) or reducing the context timeout
inside clean() to something under 1 minute (e.g., <1m) so the context timeout is
always shorter than the ticker interval; ensure you update both the ticker
declaration and the corresponding timeout logic used by clean() (and similar
logic referenced in lines 88-94) so they remain aligned.
This pull request was automatically created by
@coderabbitai/e2e-reviewer.Batch created pull request.
Summary by CodeRabbit
Bug Fixes
Chores